Authors: Mauro Venticinque, Angelo Schillaci, Daniele Tambone
GitHub project: Bank-Marketing
Date: 2025-05-15
In this project, we analyze data from a Portuguese banking institution’s direct marketing campaigns to identify key factors influencing customer subscription to term deposits.
A deposit account is a bank account maintained by a financial institution in which a customer can deposit and withdraw money. Deposit accounts can be savings accounts, current accounts or any of several other types of accounts explained below.
The dataset includes client demographics, previous campaign interactions, and economic indicators. Our goal is to develop insights that will enhance the effectiveness of future marketing strategies. By applying supervised learning techniques, we aim to predict customer responses and optimize outreach efforts for better engagement and conversion rates.
The report will begin with an Exploratory Data Analysis, examining the variables and their relationship with the target attribute (subscribed) to identify the most influential factors.
age (Integer): age of the customerjob (Categorical): occupationmarital (Categorical): marital statuseducation (Categorical): education leveldefault (Binary): has credit in default?housing (Binary): has housing loan?loan (Binary): has personal loan?contact (Categorical): contact communication typemonth (Categorical): last contact month of yearday_of_week (Integer): last contact day of the
weekduration (Integer): last contact duration, in seconds
(numeric). Important note: this attribute highly affects the output
target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known
before a call is performed. Also, after the end of the call y is
obviously known. Thus, this input should only be included for benchmark
purposes and should be discarded if the intention is to have a realistic
predictive modelcampaign (Integer): number of contacts performed during
this campaign and for this client (numeric, includes last contact)pdays (Integer): number of days that passed by after
the client was last contacted from a previous campaign (numeric; -1
means client was not previously contacted)previous (Integer): number of contacts performed before
this campaign and for this clientpoutcome (Categorical): outcome of the previous
marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)subscribed (Binary): has the client subscribed a term
deposit?Source: UCI Machine Learning Repository
Note: In our dataset there isn’t the bank
balancevariable
| Name | train |
| Number of rows | 32950 |
| Number of columns | 21 |
| _______________________ | |
| Column type frequency: | |
| character | 11 |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| job | 0 | 1 | 6 | 13 | 0 | 12 | 0 |
| marital | 0 | 1 | 6 | 8 | 0 | 4 | 0 |
| education | 0 | 1 | 7 | 19 | 0 | 8 | 0 |
| default | 0 | 1 | 2 | 7 | 0 | 3 | 0 |
| housing | 0 | 1 | 2 | 7 | 0 | 3 | 0 |
| loan | 0 | 1 | 2 | 7 | 0 | 3 | 0 |
| contact | 0 | 1 | 8 | 9 | 0 | 2 | 0 |
| month | 0 | 1 | 3 | 3 | 0 | 10 | 0 |
| day_of_week | 0 | 1 | 3 | 3 | 0 | 5 | 0 |
| poutcome | 0 | 1 | 7 | 11 | 0 | 3 | 0 |
| subscribed | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 40.04 | 10.45 | 17.00 | 32.00 | 38.00 | 47.00 | 98.00 | ▅▇▃▁▁ |
| duration | 0 | 1 | 258.66 | 260.83 | 0.00 | 102.00 | 180.00 | 318.00 | 4918.00 | ▇▁▁▁▁ |
| campaign | 0 | 1 | 2.57 | 2.77 | 1.00 | 1.00 | 2.00 | 3.00 | 43.00 | ▇▁▁▁▁ |
| pdays | 0 | 1 | 961.90 | 188.33 | 0.00 | 999.00 | 999.00 | 999.00 | 999.00 | ▁▁▁▁▇ |
| previous | 0 | 1 | 0.17 | 0.49 | 0.00 | 0.00 | 0.00 | 0.00 | 7.00 | ▇▁▁▁▁ |
| emp.var.rate | 0 | 1 | 0.08 | 1.57 | -3.40 | -1.80 | 1.10 | 1.40 | 1.40 | ▁▃▁▁▇ |
| cons.price.idx | 0 | 1 | 93.57 | 0.58 | 92.20 | 93.08 | 93.75 | 93.99 | 94.77 | ▁▆▃▇▂ |
| cons.conf.idx | 0 | 1 | -40.49 | 4.63 | -50.80 | -42.70 | -41.80 | -36.40 | -26.90 | ▅▇▁▇▁ |
| euribor3m | 0 | 1 | 3.62 | 1.74 | 0.63 | 1.34 | 4.86 | 4.96 | 5.04 | ▅▁▁▁▇ |
| nr.employed | 0 | 1 | 5167.01 | 72.31 | 4963.60 | 5099.10 | 5191.00 | 5228.10 | 5228.10 | ▁▁▃▁▇ |
The dataset includes 21 variables and 32,950 rows, with no
missing values.
Categorical variables like job and
education show good diversity, while
default, loan, and
housing have only 3 unique values.
Among numeric variables, age has a fairly normal
distribution (mean ≈ 40, sd ≈ 10), while
duration and pdays are highly skewed,
with extreme values up to 4918 and 999 respectively.
Some variables (e.g., campaign,
previous) have a low median but long tails, indicating
that most observations are clustered at low values.
Macroeconomic variables such as emp.var.rate,
euribor3m, and nr.employed are more
stable, with tight interquartile ranges, suggesting consistent economic
conditions during data collection.
Firstly we see that this dataset are unbaleanced, with the majority of people that have not subscribed.
Correlation Matrix
The
correlation matrix reveals clear patterns among the numerical variables.
Notably, euribor3m, nr.employed, and
emp.var.rate are strongly positively correlated with
each other, these suggest these variables capture similar information
about the economic environment. This should be taken into account in
predictive modeling, as using them together could lead to
multicollinearity. In contrast, variables like
campaign, pdays, and
previous show very weak correlations with most other
features, indicating they may contribute more independently to the
model.
Scatterplot Matrix
The
scatterplot matrix confirms the distribution shape and
linearity of relationships among the numeric variables. Several
variables, such as duration and pdays,
show highly skewed distributions, which could influence
model performance and may benefit from transformations (e.g., log or
binning). While some variables exhibit linear trends (e.g., euribor3m vs
nr.employed), many scatterplots show dispersed or nonlinear patterns.
This suggests that simple linear models may not fully capture the
complexity in the data.
Scatterplot Matrix by
Target
From the diagonal histograms, we can see that most
people did not subscribe as pink dominates most
distributions. However, in certain plots, the blue
points (subscribed) are concentrated in specific areas, showing
the key factors that influenced successful subscriptions. The results of
this scatterplot matrix visually support the findings explored deeper in
the following exploratory data analysis.
[NAME]
Text
[NAME]
Text
[NAME]
Text
Distribution of Age
The
age distribution is right-skewed, with a peak around 30–40 years old.
The proportion of people that have subscribed is higher among those over
60.This may be due to greater financial stability in older age
groups.
Distribution of Job
The
distribution of the occupation is not uniform, with the majority of
people that are admin. The proportion of people that have subscribed is
among the higest between all the occupation. This is probably due to the
fact that people that are admin have a higher income and are more likely
to subscribe. While student and retired people have a higher proportion
of subscription, this explain that we saw in the previous plot that the
older people and the people with higher education level are more likely
to subscribe.
Distribution of Education
About Education Level, we can see that the distribution of the education
level is not uniform, with the majority of people that have a university
degree. The proportion of people that have a university degree and that
have subscribed is among the higest between all the education level.
This is probably due to the fact that people that have a university
degree have a higher income and are more likely to subscribe.
Distribution of Contacts
About previous campaign, while most clients were not previously
contacted, the success rate is visibly higher among those who were
previously contacted more than once or had a successful prior outcome.
This suggests that prior engagement is positively associated with
subscription, but they are a small part of sample.
Distribution of Days of
Week
The distribution of the last contact day of the week
is uniform, with the majority of people that have been contacted on
Thursday. The proportion of people that have subscribed is among the
higest when the last contact day of the week is on the middle of
week.
Distribution of Months
Instead, the distribution of the last contact month of the year is not
uniform, with the majority of people that have been contacted in May.
The proportion of people that have subscribed is among the higest when
the last contact month of the year is in March, December, September and
October. This is probably due to the fact that people are more likely to
subscribe when they have more money and not during the summer.
Distribution of Duration
The duration of the last contact is right-skewed, with a peak around
0-100 seconds. The proportion of people that have subscribed is higher
among people that have been contacted for a longer duration. This is
probably due to the fact that people that have been contacted for a
longer duration are more interested to subscribe.
The Exploratory Data Analysis reveals several important insights into the factors that influence the likelihood of subscription in this dataset. Below there is a summary of the key findings: * The dataset is imbalanced, with the majority of contacted individuals not subscribing. * Both younger and older individuals exhibit a higher likelihood of subscribing compared to those in middle age. * Socio-demographic factors, such as education and jobs, appear to influence subscription rates, for example, individuals in administrative roles and those with higher education levels tend to subscribe more often. * Prior interaction with the campaign, especially repeated contacts or past successful outcomes, is positively associated with subscription. * Subscription rates vary by month, with peaks in March, December, September, and October. Additionally, longer call durations are linked to a higher likelihood of subscription. * All economic variables examined show significant associations with subscription. Specifically, lower CPI, a negative employment variation rate, and higher CCI are correlated with increased subscription rates.
In summary, the analysis suggests that financial conditions, previous campaign interactions, and macroeconomic indicators are strong predictors of subscription behavior. Demographic factors such as age, occupation, and education level also contribute meaningfully to the outcome.
In the next section, we will use these EDA findings to conduct a preliminary skim of the most influential variables, based on the visual trends observed in the plots.
train$job_student <- ifelse(train$job == "student", 1, 0)
train$job_retired <- ifelse(train$job == "retired", 1, 0)
train$job_admin <- ifelse(train$job == "admin.", 1, 0)
train$cons.price.idx<-ifelse(train$cons.price.idx<93, 1, 0)
names(train)[names(train) == "cons.price.idx"] <- "low_cpi"
train$cons.conf.idx<-ifelse(train$cons.conf.idx>median(cons.conf.idx), 1, 0)
names(train)[names(train) == "cons.conf.idx"] <- "high_cci"
train$euribor3m<-ifelse(train$euribor3m<mean(euribor3m), 1, 0)
names(train)[names(train) == "euribor3m"] <- "low_euribor"
train$emp.var.rate<-ifelse(train$emp.var.rate<0, 1, 0)
names(train)[names(train) == "emp.var.rate"] <- "negative_emp"
train$university<-ifelse(train$education=='university.degree', 1, 0)
train$p_course<-ifelse(train$education=='professional.course', 1, 0)
train <- subset(train, select = -education)
train$month_sep <- ifelse(train$month == "sep", 1, 0)
train$month_oct <- ifelse(train$month == "oct", 1, 0)
train$month_dec <- ifelse(train$month == "dec", 1, 0)
train$month_mar <- ifelse(train$month == "mar", 1, 0)
```{r
Social and economic context attributes:
emp.var.rate(Integer): employment variation rate - quarterly indicatorcons.price.idx(Integer): consumer price index - monthly indicatorcons.conf.idx(Integer): consumer confidence index - monthly indicatoreuribor3m(Integer): euribor 3 month rate - daily indicatornr.employed(Integer): number of employees - quarterly indicator